stock return
Stochastic Discount Factors with Cross-Asset Spillovers
The central objective of empirical asset pricing is to identify firm-level signals that explain the cross-section of expected stock returns--whether through exposure to risk factors or persistent mispricing. The dominant paradigm, grounded in the assumption of self-predictability, asserts that a firm's own characteristics forecast its own returns (see, e.g., Cochrane (2011); Harvey et al. (2016)). Complementing this view is a growing literature on cross-predictability--the idea that the characteristics or returns of one asset can help forecast the returns of others (see, e.g., Lo and MacKinlay (1990); Hou (2007); Cohen and Frazzini (2008); Cohen and Lou (2012); Huang et al. (2021, 2022)). A key mechanism underpinning this phenomenon is the presence of lead-lag effects, whereby price movements or information from one firm precede and predict those of related firms. Such effects can stem from staggered information diffusion, peer influence within industries, supply chain linkages, or correlated trading by institutional investors that induces price pressure across related assets. Despite recent methodological advances in modeling cross-stock predictability, several foundational questions remain unresolved. Chief among them is how a mean-variance investor can analytically integrate multiple predictive signals when returns are interconnected across assets. Equally crucial is developing a framework that jointly captures both the relevance of individual signals and the structure of return spillovers--enhancing portfolio performance while preserving interpretability .
Extracting the Structure of Press Releases for Predicting Earnings Announcement Returns
Wu, Yuntao, Akin, Ege Mert, Martineau, Charles, Grรฉgoire, Vincent, Veneris, Andreas
We examine how textual features in earnings press releases predict stock returns on earnings announcement days. Using over 138,000 press releases from 2005 to 2023, we compare traditional bag-of-words and BERT-based embeddings. We find that press release content (soft information) is as informative as earnings surprise (hard information), with FinBERT yielding the highest predictive power. Combining models enhances explanatory strength and interpretability of the content of press releases. Stock prices fully reflect the content of press releases at market open. If press releases are leaked, it offers predictive advantage. Topic analysis reveals self-serving bias in managerial narratives. Our framework supports real-time return prediction through the integration of online learning, provides interpretability and reveals the nuanced role of language in price formation.
Deep Reinforcement Learning in Factor Investment
Deep reinforcement learning (DRL) has shown promise in trade execution, yet its use in low-frequency factor portfolio construction remains under-explored. A key obstacle is the high-dimensional, unbalanced state space created by stocks that enter and exit the in-vestable universe. We introduce Conditional Auto-encoded Factor-based Portfolio Optimisation (CAFPO), which compresses stock-level returns into a small set of latent factors conditioned on 94 firm-specific characteristics. The factors feed a DRL agent--implemented with both PPO and DDPG--to generate continuous long-short weights. On 20 years of U.S. equity data (2000-2020), CAFPO outperforms equal-weight, value-weight, Markowitz (historical & factor), vanilla DRL, and Fama-French-driven DRL, delivering a 24.6% compound return and a Sharpe ratio of 0.94 out of sample. SHAP analysis further reveals economically intuitive factor attributions. Our results demonstrate that factor-aware representation learning can make DRL practical for institutional, low-turnover portfolio management.
Machine Learning Classification and Portfolio Allocation: with Implications from Machine Uncertainty
Bai, Yang, Pukthuanthong, Kuntara
We use multi-class machine learning classifiers to identify the stocks that outperform or underperform other stocks. The resulting long-short portfolios achieve annual Sharpe ratios of 1.67 (value-weighted) and 3.35 (equal-weighted), with annual alphas ranging from 29\% to 48\%. These results persist after controlling for machine learning regressions and remain robust among large-cap stocks. Machine uncertainty, as measured by predicted probabilities, impairs the prediction performance. Stocks with higher machine uncertainty experience lower returns, particularly when human proxies of information uncertainty align with machine uncertainty. Consistent with the literature, such an effect is driven by the past underperformers.
Assessing Uncertainty in Stock Returns: A Gaussian Mixture Distribution-Based Method
Wang, Yanlong, Xu, Jian, Huang, Shao-Lun, Sun, Danny Dongning, Zhang, Xiao-Ping
This study seeks to advance the understanding and prediction of stock market return uncertainty through the application of advanced deep learning techniques. We introduce a novel deep learning model that utilizes a Gaussian mixture distribution to capture the complex, time-varying nature of asset return distributions in the Chinese stock market. By incorporating the Gaussian mixture distribution, our approach effectively characterizes short-term fluctuations and non-traditional features of stock returns, such as skewness and heavy tails, that are often overlooked by traditional models. Compared to GARCH models and their variants, our method demonstrates superior performance in volatility estimation, particularly during periods of heightened market volatility. It provides more accurate volatility forecasts and offers unique risk insights for different assets, thereby deepening the understanding of return uncertainty. Additionally, we propose a novel use of Code embedding which utilizes a bag-of-words approach to train hidden representations of stock codes and transforms the uncertainty attributes of stocks into high-dimensional vectors. These vectors are subsequently reduced to two dimensions, allowing the observation of similarity among different stocks. This visualization facilitates the identification of asset clusters with similar risk profiles, offering valuable insights for portfolio management and risk mitigation. Since we predict the uncertainty of returns by estimating their latent distribution, it is challenging to evaluate the return distribution when the true distribution is unobservable. However, we can measure it through the CRPS to assess how well the predicted distribution matches the true returns, and through MSE and QLIKE metrics to evaluate the error between the volatility level of the predicted distribution and proxy measures of true volatility.
Real-time Monitoring of Economic Shocks using Company Websites
Koenig, Michael, Rauch, Jakob, Woerter, Martin
Understanding the effects of economic shocks on firms is critical for analyzing economic growth and resilience. We introduce a Web-Based Affectedness Indicator (W AI), a general-purpose tool for real-time monitoring of economic disruptions across diverse contexts. By leveraging Large Language Model (LLM) assisted classification and information extraction on texts from over five million company websites, W AI quantifies the degree and nature of firms' responses to external shocks. Using the COVID-19 pandemic as a specific application, we show that W AI is highly correlated with pandemic containment measures and reliably predicts firm performance. Unlike traditional data sources, W AI provides timely firm-level information across industries and geographies worldwide that would otherwise be unavailable due to institutional and data availability constraints. This methodology offers significant potential for monitoring and mitigating the impact of technological, political, financial, health or environmental crises, and represents a transformative tool for adaptive policy-making and economic resilience. Economic shocks, whether driven by public health crises, technological disruptions, geopolitical conflicts, or climate events, pose significant challenges to businesses and policymakers alike. Timely and accurate monitoring of these shocks is critical for crafting effective responses and enhancing economic resilience. However, traditional methods for measuring the impacts of such disruptions - such as surveys and administrative data - are often limited by costs, time lags, and coverage. In this study, we introduce the Web-Based Affectedness Indicator (W AI), a scalable and cost-effective tool for real-time monitoring of economic disruptions at the firm level. By analyzing textual data from millions of company websites, W AI provides granular insights into how firms experience and respond to external shocks. This 1 methodology overcomes traditional limitations by leveraging ubiquitous online content and state-of-the-art natural language processing (NLP) models to generate a dynamic and comprehensive view of economic affectedness. W AI can provide information on a wide range of challenges, including supply chain disruptions, financial crises, and climate-related shocks.
FactorGCL: A Hypergraph-Based Factor Model with Temporal Residual Contrastive Learning for Stock Returns Prediction
Duan, Yitong, Wang, Weiran, Li, Jian
As a fundamental method in economics and finance, the factor model has been extensively utilized in quantitative investment. In recent years, there has been a paradigm shift from traditional linear models with expert-designed factors to more flexible nonlinear machine learning-based models with data-driven factors, aiming to enhance the effectiveness of these factor models. However, due to the low signal-to-noise ratio in market data, mining effective factors in data-driven models remains challenging. In this work, we propose a hypergraph-based factor model with temporal residual contrastive learning (FactorGCL) that employs a hypergraph structure to better capture high-order nonlinear relationships among stock returns and factors. To mine hidden factors that supplement human-designed prior factors for predicting stock returns, we design a cascading residual hypergraph architecture, in which the hidden factors are extracted from the residual information after removing the influence of prior factors. Additionally, we propose a temporal residual contrastive learning method to guide the extraction of effective and comprehensive hidden factors by contrasting stock-specific residual information over different time periods. Our extensive experiments on real stock market data demonstrate that FactorGCL not only outperforms existing state-of-the-art methods but also mines effective hidden factors for predicting stock returns.
Regression and Forecasting of U.S. Stock Returns Based on LSTM
Zhou, Shicheng, Zhang, Zizhou, Zhang, Rong, Yin, Yuchen, Chang, Chia Hong, Shen, Qinyan
This paper analyses the investment returns of three stock sectors, Manuf, Hitec, and Other, in the U.S. stock market, based on the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model, in order to test the validity of the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model for the three sectors of the market. French five-factor model for the three sectors of the market. Also, the LSTM model is used to explore the additional factors affecting stock returns. The empirical results show that the Fama-French five-factor model has better validity for the three segments of the market under study, and the LSTM model has the ability to capture the factors affecting the returns of certain industries, and can better regress and predict the stock returns of the relevant industries. Keywords- Fama-French model; Carhart model; Factor model; LSTM model.
Sentiment trading with large language models
We investigate the efficacy of large language models (LLMs) in sentiment analysis of U.S. financial news and their potential in predicting stock market returns. We analyze a dataset comprising 965,375 news articles that span from January 1, 2010, to June 30, 2023; we focus on the performance of various LLMs, including BERT, OPT, FINBERT, and the traditional Loughran-McDonald dictionary model, which has been a dominant methodology in the finance literature. The study documents a significant association between LLM scores and subsequent daily stock returns. Specifically, OPT, which is a GPT-3 based LLM, shows the highest accuracy in sentiment prediction with an accuracy of 74.4%, slightly ahead of BERT (72.5%) and FINBERT (72.2%). In contrast, the Loughran-McDonald dictionary model demonstrates considerably lower effectiveness with only 50.1% accuracy. Regression analyses highlight a robust positive impact of OPT model scores on next-day stock returns, with coefficients of 0.274 and 0.254 in different model specifications. BERT and FINBERT also exhibit predictive relevance, though to a lesser extent. Notably, we do not observe a significant relationship between the Loughran-McDonald dictionary model scores and stock returns, challenging the efficacy of this traditional method in the current financial context. In portfolio performance, the long-short OPT strategy excels with a Sharpe ratio of 3.05, compared to 2.11 for BERT and 2.07 for FINBERT long-short strategies. Strategies based on the Loughran-McDonald dictionary yield the lowest Sharpe ratio of 1.23. Our findings emphasize the superior performance of advanced LLMs, especially OPT, in financial market prediction and portfolio management, marking a significant shift in the landscape of financial analysis tools with implications to financial regulation and policy analysis.
Modeling High-Dimensional Dependent Data in the Presence of Many Explanatory Variables and Weak Signals
This article considers a novel and widely applicable approach to modeling high-dimensional dependent data when a large number of explanatory variables are available and the signal-to-noise ratio is low. We postulate that a $p$-dimensional response series is the sum of a linear regression with many observable explanatory variables and an error term driven by some latent common factors and an idiosyncratic noise. The common factors have dynamic dependence whereas the covariance matrix of the idiosyncratic noise can have diverging eigenvalues to handle the situation of low signal-to-noise ratio commonly encountered in applications. The regression coefficient matrix is estimated using penalized methods when the dimensions involved are high. We apply factor modeling to the regression residuals, employ a high-dimensional white noise testing procedure to determine the number of common factors, and adopt a projected Principal Component Analysis when the signal-to-noise ratio is low. We establish asymptotic properties of the proposed method, both for fixed and diverging numbers of regressors, as $p$ and the sample size $T$ approach infinity. Finally, we use simulations and empirical applications to demonstrate the efficacy of the proposed approach in finite samples.